Deep Joint Learning of Pathological Region Localization and Alzheimer's Disease Diagnosis

The identification of Alzheimer's disease (AD) and its early stages using structural magnetic resonance imaging (MRI) has been attracting the attention of researchers. Various data-driven approaches have been introduced to capture subtle and local morphological changes of the brain accompanied by the disease progression. One of the typical approaches for capturing subtle changes is patch-level feature representation. However, the predetermined regions to extract patches can limit classification performance by interrupting the exploration of potential biomarkers. In addition, the existing patch-level analyses have difficulty explaining their decision-making. To address these problems, we propose the BrainBagNet with a position-based gate (PG-BrainBagNet), a framework for jointly learning pathological region localization and AD diagnosis in an end-to-end manner. In advance, as all scans are aligned to a template in image processing, the position of brain images can be represented through the 3D Cartesian space shared by the overall MRI scans. The proposed method represents the patch-level response from whole-brain MRI scans and discriminative brain-region from position information. Based on the outcomes, the patch-level class evidence is calculated, and then the image-level prediction is inferred by a transparent aggregation. The proposed models were evaluated on the ADNI datasets. In five-fold cross-validation, the classification performance of the proposed method outperformed that of the state-of-the-art methods in both AD diagnosis (AD vs. normal control) and mild cognitive impairment (MCI) conversion prediction (progressive MCI vs. stable MCI) tasks. In addition, changes in the identified discriminant regions and patch-level class evidence according to the patch size used for model training are presented and analyzed.


Introduction
Alzheimer's disease (AD) is a type of neurodegenerative brain disorder characterized by an irreversible and progressive loss of neurons (Jagust, 2013).This degenerative disorder is also considered the most common cause of dementia (Barker et al., 2002), which is commonly accompanied by such symptoms as memory loss and progression toward long-term impairment of cognitive functioning.However, these symptoms are not manifest in the early stages of AD, and the symptoms gradually worsen without being recognized as the condition (Larsen, 2019).Moreover, AD primarily progresses through a prodromal stage referred to as mild cognitive impairment (MCI).Thus, in the past few decades, numerous studies (Mosconi et al., 2007;Gray et al., 2013;Liu et al., 2014;Rathore et al., 2017;Arbabshirani et al., 2017;Jung et al., 2021) have focused on both accurately classifying the AD group from the normal control (NC) group and predicting the MCI conversion/transition to detect the early stages of AD.Typically, the MCI conversion prediction task is to distinguish between stable MCI (sMCI) and progressive MCI (pMCI) based on the risk of AD progression.
These studies on AD diagnosis and its early detection can help identify high-risk cohorts for better treatment planning and to further improve the quality of life (Fung et al., 2019).
Although the symptoms that appear in AD are not manifested in the early stages, the biological processes underlying the disease are present for decades before the symptoms occur (Bennett et al., 2006;Jagust, 2018).Therefore, neuroimaging data, such as magnetic resonance imaging (MRI), have been employed for examining neurodegeneration, diagnosing AD, and detecting its early stages (Frisoni et al., 2010;Wolz et al., 2011;Coupé et al., 2012).Furthermore, brain atrophy is the most proximate substrate of cognitive impairment in AD; thus, automated computer-aided diagnosis based on structural MRI (sMRI) has been attracting attention as promising studies (Vemuri and Jack, 2010;Liu et al., 2020;Tanveer et al., 2020).In addition, explaining diagnostic results as clinical evidence is crucial for clinical application (Lee et al., 2019;Eitel et al., 2019).
Most existing 3D CNN-based AD diagnostic models (Korolev et al., 2017;Fung et al., 2019;Jin et al., 2019;Wang et al., 2020;Liu et al., 2020Liu et al., , 2018a;;Lian et al., 2018Lian et al., , 2020) ) can be categorized into two classes according to the input type.One option is to take a 3D whole-brain image as input.The other option is to take a bag of instances in the multiple instance learning (MIL) terminology, where each 3D whole brain image is regarded as a bag, and the 3D patches extracted from the bag are treated as instances.When models take whole-brain images as input, features are hierarchically extracted, from local to global patterns, for an accurate AD diagnosis.The models gradually expand the region size that produces the features (i.e., receptive field), and the size eventually becomes comparable to the input image size.In this process, local and global structural patterns are captured and employed for making decisions.
Recent studies taking this practical feature representation approach put their efforts into better feature extraction by proposing novel network architectures.For instance, as the AD progression accompanies local and subtle brain changes, (Liu et al., 2020) proposed a network architecture avoiding the early spatial downsampling to design architectures capable of learning subtle differences.In addition, (Jin et al., 2019(Jin et al., , 2020) introduced an attention-based CNN model to automatically generate more discriminative feature representation in brain images.However, deep learning methods are criticized for the difficulty of retracing the classification decision due to the vast parameter spaces and nonlinear interactions (Castelvecchi, 2016).In particular, in medical applications, interpretation of the decision-making (e.g., which brain regions contribute to decision-making) is essential for clinical integration, error tracking, and knowledge discovery (Eitel et al., 2019).Although existing models (Böhle et al., 2019;Eitel et al., 2019;Liu et al., 2020) have mitigated this problem by relying on additional applications that generate heatmaps for each subject, the voxel-wise relevance represented by heatmaps does not allow making a statement about the underlying reasons such as brain atrophy or pathological structural changes (Böhle et al., 2019).
In the case of taking a bag of instances as input, various approaches (Liu et al., 2012b;Suk et al., 2014;Tong et al., 2014) for patch-level feature representation have attempted to more efficiently characterize local structural changes induced by AD.In addition, feature representation using the CNN model (Liu et al., 2018a;Lian et al., 2018Lian et al., , 2020) ) has recently been proposed.These approaches are intended to extract local and subtle structural changes by allowing models to extract features from the specific receptive field size with an upper bound on the patch size.Conventional methods (Coupé et al., 2012;Zhang et al., 2017) generally assign the class label of an image to all patches extracted from the corresponding image to extract patch-level feature representation.However, the class labels for each patch are ambiguous because not all patches extracted from patients' brains include changes associated with pathology (Tong et al., 2013).Even, the proportion of ADinduced brain atrophy might vary according to the individual patient.In this circumstance, weakly supervised learning strategies, such as MIL, have been adopted (Tong et al., 2014;Liu et al., 2018a).A given brain MRI scan is considered a bag, and the 3D patches that comprise the bag are treated as instances.The simplest method to extract the patches is to use all of the patches over the brain.However, the number of patches that can be extracted from 3D high-resolution brain images is too large.In addition, including numerous patches not associated with AD can lead to a low proportion of instances containing AD pathological changes in bags labeled as the AD class.This method can cause serious class imbalance problems and degrade performance for many real-world problems (Carbonneau et al., 2016(Carbonneau et al., , 2018)).Therefore, localizing the pathological brain regions and extracting patches from these regions are an important and challenging problems.
For patch extraction, existing studies have introduced data-driven approaches (Liu et al., 2012b;Tong et al., 2014;Liu et al., 2018a,b), which commonly follow a standard pipeline.First, a discriminative prob-ability map is generated by using group differences of local features.Then, based on the probability map, patches are extracted by various hyperparameters, such as patch size, number of patches, and the minimum distance between patches.These approaches have successfully extracted class-discriminative patches and presented an informative result, leading to a high increase in the classification performance of the diagnostic model.However, the predetermined area to extract patches can lead to suboptimal classification performance because it has been independently performed without considering feature extraction and classification.In addition, hyperparameters for patch extraction eventually cause time-consuming hyperparameter exploration or optimization procedures by iterating the diagnostic model training according to numerous hyperparameter combinations.
To alleviate issues arising from the independence of patch extraction and diagnostic model training, a hierarchical fully convolutional network (H-FCN) was proposed for joint atrophy localization and disease identification (Lian et al., 2018).Hybrid loss was designed by gathering patch-, region-, and image-level loss for diagnostic model training and ranking the discriminative capacity of the corresponding location.However, hybrid loss includes an objective such that the patch-and region-level classification scores become closer to the image-level ground-truth class labels, and this trial might include faulty assumptions because not all patches and regions might have been affected by AD.In addition, the pruning approach cannot consider potential patch-level biomarkers not previously extracted as candidates.More recently, a hybrid network (Hyb-Net) (Lian et al., 2020) improved the H-FCN by fusing two branches, which could extract local and global structural information, respectively.Although a novel discriminative brain-region localization approach has been introduced for patch extraction, the fundamental problem caused by brain-region predetermination remains.In addition, as the proposed learning pipeline becomes increasingly complex, there are limitations regarding a lack of explanation of the prediction results despite competitive performance.
The current methods for patch-level feature representation have evolved from various perspectives based on the predetermined brain-region localization.However, the dependency between discriminative brainregion localization, patch extraction, and the diagnostic model has usually been overlooked.For instance, if diagnostic model predictions were made using 3D patches as large as a whole-brain image, most patches could contain pathological changes wherever the patches were extracted.However, as 3D patches become smaller, the learning model can efficiently characterize subtle brain changes, whereas patches with pathological changes can only be detected in sparse areas of the brain.Therefore, predetermination of a brain region to extract patches may limit the opportunity to explore potential biomarkers or lead to a low proportion of instances containing AD pathological changes in bags labeled as the AD class.A joint learning framework for discriminative brain-region localization and disease diagnosis can effectively prevent these problems.
To provide the rationale for decision-making, (Melendez et al., 2014(Melendez et al., , 2015) ) attempted to combine instancelevel responses for a bag classification, which could highlight abnormalities in the image.However, these methods have poor instance-level accuracy and are inconsistent in MIL methods at the instance level (Kandemir and Hamprecht, 2015;Cheplygina et al., 2015).One of the promising reasons for this is that MIL models may detect only the most abnormal parts of the image where multiple abnormalities are present (Cheplygina et al., 2019).Recently, to alleviate the issue, (Ilse et al., 2018) introduced an attention-based MIL.The attention mechanism was used to determine instances that trigger the bag label, called key instances (Liu et al., 2012a).Compared to the attention-based MIL, discriminative brain-region localization in AD studies can be considered a type of key instance detection to localize brain regions where the bag label is triggered.
We propose the BrainBagNet with a position-based gate (PG-BrainBagNet) for joint learning of pathological region localization and disease diagnosis.As illustrated in Fig. 1, the proposed method starts with two branches, called the patch-level prediction branch and position-based gating branch.Given an MRI scan, the patch-level prediction branch extracts local features from a specific receptive field size (i.e., 3D patches) and produces patch-level responses.As all MRI scans are aligned to a 3D template in image processing, the position information in 3D space can indicate brain regions.The position information of the brain image is represented via coordinates in 3D Cartesian space.Then, given the position information, the position-based gating branch identifies the brain regions where the class-discriminative responses consistently appear across subjects.Whereas the existing approaches select patches before learning feature representation, the proposed position-based gating branch learns pathological region localization by linearly interacting with patch-level responses.By leveraging the soft region proposals generated by the position-based gating branch, patch-level responses obtained in the patch-level prediction branch are converted to patch-level class evidence, which is transparently aggregated to determine an image-level response.Transparent aggregation for the image-level response alleviates the difficulty of interpretation due to global and nonlinear interactions.We evaluated the effectiveness of the proposed method on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset.In addition, we conducted extensive ablation studies in both AD vs. NC and pMCI vs. sMCI binary classification tasks, called AD diagnosis and the MCI conversion prediction task, respectively.The proposed method outperformed the comparison methods in both AD diagnosis and MCI conversion prediction tasks.We identified the discriminative brain regions and patch-level class evidence in a weakly supervised learning manner and analyzed the changes according to patch size.

CNN-based Alzheimer's Disease Diagnosis
The development of deep learning methods, including the CNN, has efficiently addressed multistep pipelines for handcrafted feature generation/extraction and logistic regression by training a model in an endto-end manner (Korolev et al., 2017).Thus, studies on the accurate AD diagnosis based on the 3D CNN are underway by taking 3D whole-brain images as input (Basaia et al., 2019;Fung et al., 2019;Liu et al., 2020).In particular, various architectures have been proposed for accurate AD diagnosis (Korolev et al., 2017;Wang et al., 2020;Jin et al., 2019Jin et al., , 2020)).For instance, (Liu et al., 2020) demonstrated the changes in disease identification performance according to various factors, such as the normalization layer, kernel size, network architecture width, and patient age.For network architecture, the model introduced in (Liu et al., 2020) was designed to learn subtle changes in the brain by limiting the size reduction of feature maps during low-level feature extraction steps.
Moreover, the attention mechanism gradually became popular and widely employed in the CNN-based image recognition model to better explain network behavior and generate more discriminative feature representations (Zhang et al., 2019;Jetley et al., 2018;Schlemper et al., 2019).In AD analysis, (Jin et al., 2019(Jin et al., , 2020) introduced an attention-based 3D convolutional network for disease identification and biomarker exploration.The network architecture was designed based on the ResNet (He et al., 2016), and the attention module was embedded in the middle of the network.By taking the extracted local features as input, the attention module produced spatial weights.As described in the literature, the goal of the attention module is to represent the regional importance during end-to-end training.Moreover, in backward propagation, the produced attention could work as a gradient filter.
As the early AD stages could only be identified using subtle local pathological cues, patch-level feature representations have been investigated to capture subtle local pathological changes more efficiently.However, only a few brain regions could contain cues for disease identification; thus, an additional discriminative patch extraction procedure was required in advance.Employing prior anatomical knowledge and the discriminative probability yielded by statistical approaches was the typical initial step for extracting discriminative patches (Tong et al., 2013).Patch extraction has been improved through resampling schemes using Elastic Net (Janoušová et al., 2012;Tong et al., 2014).Recently, a landmark discovery algorithm for AD diagnosis was introduced for discriminative patch extraction (Liu et al., 2018a).The algorithm started with a multivariate statistical test on training images performed using nonlinear registration.A p-value map was obtained from the template space.Based on the p-value map, landmarks were determined based on the size and number of patches and the minimum distance between patches.Patches extracted from the landmark were used for diagnosis model training.Moreover, as many 3D CNNs as the number of determined landmarks were configured to extract features for each patch.The following fully connected layers employed concatenated patch-level representations for bag-level prediction.
However, discriminative region localization and patch extraction were performed independent of the diagnostic model, which could result in suboptimal diagnostic performance (Lian et al., 2018).To alleviate this limitation, (Lian et al., 2018) proposed a hybrid loss and pruning strategy based on the H-FCN.The proposed model extracted multiscale feature representations (i.e.patch-, region-, and subject-level features) by employing the CNN.This hierarchical construction of the network architecture allows the trained model to identify the most informative patches and regions through hybrid loss.Pruning less discriminative areas identified by hybrid loss could improve diagnostic results.However, hybrid loss was defined by considering all patchand region-level features belonging to the patient's MRI images as positive samples, although not all patches and regions would necessarily be affected by AD.In addition, this approach still relies on predetermined landmarks for better classification performance in the initial stage.
Recently, a HybNet (Lian et al., 2020) was proposed to improve the H-FCN, which considers both global and local structural information.Specifically, two branches were constructed: the global branch (GB) and local branch (LB).The subject-specific and intersubject-consistent discriminative region localization approaches were applied in the GB and LB, respectively.Both localizations were extracted by the pretrained fully convolutional network (FCN) backbone.The FCN backbone was trained for weakly supervised object localization (WSOL) and could generate a class activation map (Zhou et al., 2016).Disease attention maps (DAMs) have been produced to represent subject-specific AD-related brain regions based on class activation map outcomes.Moreover, the mean of the DAMs is calculated by simply averaging DAMs produced by the considerable samples in the training set.Given the localization results in advance, the GB and LB were trained.The GB used DAMs, which represent subject-specific discriminative brain regions, as the spatial attention.However, the LB was trained given patches extracted using intersubject-consistent discriminative brain regions represented by the mean DAM.The patch extraction was performed identically to the H-FCN, but the mean DAM was used instead of the discriminative probability map obtained by the statistical test.
Moreover, this study indicated that the feature representations extracted from both branches could be complementary, and their fusion could improve classification performance.Although additional information was used to address the shortcomings of H-FCN, the predetermination of patches may hamper the effectiveness of end-to-end learning of local feature extraction and diagnosis.

AD-associated Brain-region Localization
The AD-associated brain-region localization methods have been proposed and developed for various purposes.Specifically, these methods boost the classification performance of diagnostic models, detect potential biomarkers in AD diagnosis, and better explain the behavior of deep learning networks.These brain-region localization methods can be divided into two categories according to the information used in the localization: feature-based and position-based approaches.
First, feature-based approaches produce the brain-region localization result based on individual local features extracted from each brain image.These can be further divided into supervised learning and weakly supervised learning approaches based on the learning strategy.For supervised learning approaches, a relatively large patch or region of interest (ROI) extracted from the image was assigned the same annotation as the image-level annotation (Lee et al., 2019;Qiu et al., 2020).A model was trained to represent regional abnormalities by subject.Regional outcomes were used as features to estimate the individual disease states.
Although the identified regional abnormalities can provide clinical evidence, the evidence was relatively coarse.In addition, this supervised approach assumes that all regions in patients are affected by AD.This fact has recently led to the application of weakly supervised learning and MIL.
Regarding weakly supervised learning approaches, (Zhou et al., 2016) proposed a representative WSOL method through an FCN with a global average pooling (GAP) layer and linear classifier.Due to the linear property of the GAP and linear classifier, areas that contributed significantly to the predictions can be tracked.This approach has evolved from several perspectives (Singh and Lee, 2017;Yun et al., 2019), and (Brendel and Bethge, 2019) proposed bag-of-local-feature models, which provide patch-level class evidence by limiting the receptive field size of the topmost feature maps.In the AD study, WSOL was employed in (Lian et al., 2018) to localize AD-related structural abnormalities at a finer scale by training an additional 3D FCN.
Moreover, WSOL was applied to represent the regional importance for better feature representation (Li et al., 2019;Lian et al., 2020).(Li et al., 2019) proposed an iterative learning framework leveraged by the localization result generated by WSOL.Further, (Lian et al., 2020) introduced a subject-specific discriminative brain-region localization called a DAM.For a similar purpose, an attention mechanism can be attached to a diagnostic model.(Jin et al., 2019(Jin et al., , 2020) ) proposed an attention-based diagnosis model for joint learning of discriminative brain-region localization and disease identification.
Unlike feature-based region localization, position-based brain-region localization methods detect regions where significant differences appear between the AD and NC groups.The identified brain regions are consistent across subjects so that this method could be called intersubject-consistent discriminative region localization (Lian et al., 2020).All sMRI scans are aligned to the same template in preprocessing; thus, all samples share the same 3D space.This shared space allows a group comparison of local features, and the statistical test could generate a probability map representing the discriminative capacity.In particular, this position-based localization method has been widely used in patch extraction for patch-level feature representation.Data-driven pathological brain-region localization approaches have continued evolving as described in the previous section such as the statistical approach (Tong et al., 2014;Suk et al., 2014), landmark discovery (Tong et al., 2014), pruning strategy (Lian et al., 2018), and mean DAM (Lian et al., 2020).However, the existing patch extraction methods are performed independently of image-level diagnostic model outcomes.
Inspired by the recent patch-level analysis in AD diagnosis, we propose a framework that jointly learns pathological region localization and disease identification in an end-to-end manner.In addition, final decisionmaking is conducted through the transparent aggregation of the patch-level responses, providing patch-level class evidence for decision-making.To the best of our knowledge, this framework is the first for joint learning of position-based discriminative brain-region localization and disease identification in an end-to-end manner.

Subjects and Image Processing
We used two public datasets (ADNI-1 and ADNI-2) from the ADNI2 cohort.First, we collected the baseline brain sMRI scans and the diagnostic information from the datasets.Then, we removed the scans that appear in both ADNI-1 and ADNI-2 from ADNI-2, so that our dataset contains one sMRI scan for a subject.
The disease state of collected scans was categorized into three classes: NC, MCI, and AD.We further divided each MCI subject into two classes for the MCI conversion prediction task.If the patient corresponding to a baseline image had not been diagnosed with an AD class by 72 months, the image was labeled as the  1.
The prepared brain scans were processed using the following pipeline.First, the brain extraction procedure was performed by the HD-BET brain extraction tool (Isensee et al., 2019) to remove areas other than brain images (e.g., neck, skull, and so on.).Then, an affine registration was performed to linearly align each sMRI scan to the MNI152 template.The process has been done through the FLIRT method in the FSL3 package.This process removed global linear differences over the images such as global translation, scale, and rotation differences.Furthermore, the process allowed all images to have identical spatial resolution (1 × 1 × 1mm 3 ).Finally, the size of the processed brain scans was 193 × 229 × 193, and each image was normalized through the mean and standard deviation of each image.

Overview of Methods
An overview of the proposed PG-BrainBagNet is presented in Fig. 1.There are four parameterized networks: the encoder, classifier, position embedding, and gate network.These networks are organized in two branches, and each branch takes different inputs: X and I .The input MRI scan X can be considered a set Moreover, I is a representation for the patch position information: the position indicator.
The patch-level prediction branch comprises encoder and classifier networks.First, we construct an encoder network that can adjust the patch size for feature extraction according to the receptive field size of the top-level feature maps.The constructed encoder takes the whole brain image X ∈ R W ×H×D×1 as input and extracts the local features from 3D patches of size s × s × s.Given the patch-level features X, the classifier network produces patch-level responses X ∈ R w×h×d×1 .Thus, we obtained responses from w × h × d The MRI scans were processed by linear affine registration to a template; thus, all images share a 3D Cartesian space.Through the 3D Cartesian coordinate system, a patch located in (i, j, k) can be defined as x s i,j,k , where (i, j, k) denotes the coordinates in 3D space, as described in Fig. 2a.To reflect the position information in patch-level representation, we represent coordinates in 3D Cartesian space as a 4D tensor inspired by prior work (Liu et al., 2018c) and introduce a scheme for extracting position information of patch-level responses, as presented in Fig. 2b.First, coordinates in 3D Cartesian space are represented as a 4D tensor I ∈ R W ×H ×D ×3 , which consists of three channels (i.e., coronal, sagittal, and axial axis).Data  Lastly, element-wise multiplication of the outcomes obtained from the two branches generates patch-level class evidence.The image-level disease state is inferred by the transparent aggregation of the patch-level class evidence.This process leads the model to be trained to detect patch-level class evidence by jointly learning AD-related local morphological changes and the regions where the discriminative changes sustainably appear.Moreover, the patch-level class evidence indicates which patches contributed significantly to the model prediction.In the following section, we present the details of these steps.

Patch-level Features Extraction and Classification
To handle local and morphological changes distributed in the whole brain, we constructed an encoder E s φ and classifier network C ψ parameterized with φ and ψ, respectively.These networks comprise convolutional layers, which share learning parameters across the spatial dimension.Thus, we effectively extract patchlevel feature representation and patch-level responses with whole brain images as input by controlling the receptive field size of the topmost feature maps.In particular, the kernel and stride sizes in the convolutional operator allow adjusting the patch sizes and distance between them.Additionally, we employed this method to construct network architectures introduced in BagNets (Brendel and Bethge, 2019).Based on BagNets, we reconstructed shallower encoders employing 3D convolutional layers.The goal of configuring an encoder network E s φ is to represent local features extracted from patches of size s × s × s.In addition, the encoder network is configured so that the number of extracted patches and center position is the same regardless of patch size to compare the changes by size fairly.Specifically, encoder network E s φ consists of convolutional block, max pooling, and four residual blocks, as illustrated in Fig. 1.All convolutional blocks in this study include sequential operators of the convolutional layer, instance normalization layer, and rectified linear unit (ReLU) activation function.The kernel and stride size for the first convolutional block is set to 5 × 5 × 5 and 2 × 2 × 2, respectively.For the following max-pooling layer, the kernel size is 3 × 3 × 3, and the stride size is 2 × 2 × 2. The feature maps yielded by max pooling have a receptive field size of 9 × 9 × 9. Thus, if the receptive field size does not increase further, the encoder can extract features with a specific receptive field size of 9 × 9 × 9 from the whole brain.
With nine as the minimum size, we constructed encoders based on five receptive field sizes according to the following residual blocks.The residual blocks were modified based on the blocks used in ResNet18, the basic residual block (Korolev et al., 2017).Each residual block takes an argument k ∈ {1, 3}, representing a k × k × k kernel size of the first convolutional layer, as described in Fig. 1.The other convolutional layers in each residual block have point-wise convolution, enhancing the complexity in the representatives without increasing the receptive field size.For the four sequential residual blocks that take [1, 1, 1, 1] as arguments, the receptive field size is not increased, and E 9 φ denotes the encoder.For E 17 φ , the sequential residual blocks take k as [3, 1, 1, 1].Likewise, E 25 φ , E 41 φ , and E 57 φ take k as [3, 3, 1, 1], [3, 3, 3, 1], and [3, 3, 3, 3], respectively.For the normalization layer and nonlinear activation function in residual blocks, instance normalization layer and ReLU is employed, respectively.Different kernel sizes in residual blocks can change the total number of local features and their center position of the receptive field, which might be additional variables to affect decision-making.All convolutional layers in the encoder network involve replicating padding as much as the quotient resulting from dividing the kernel size by two for fair comparison by patch size.
Overall, we can achieve patch-level feature representation X ∈ R w×h×d×f for an individual MRI scan X. Patch-level features are represented in feature maps, where w, h, and d denote the size of the feature map, and f is the corresponding number of feature maps.Therefore, w × h × d number of patches are embedded in the f -dimensional vector space.Then, the classifier network C ψ converts the f -dimensional vector into a scalar for patch-level responses and produces X ∈ R w×h×d×1 .The patch-level responses As both the encoder and classifier networks extract local responses in the specific receptive field size and share the extracting function over all spatial dimensions, each local response was extracted by a 3D patch in the brain without considering its position in the brain.

Position-based Gate for AD-related Brain-region Localization
As all MRI scans were aligned in a 3D template in the image processing step, a 3D space can be shared and is applicable over the samples.Moreover, all 3D patches are single-scale patches with 3D cubic shapes.
Thus, the patches distributed in the entire brain can be differentiated only by the patch position information, and the center patch position is the representative position information.The simplest method to indicate positions is to use a one-hot representation.However, this approach can be inefficient due to the numerous patches and ignores the volumetric position in 3D space.This problem can be efficiently addressed using the Cartesian coordinate system.Inspired by a representation proposed in (Liu et al., 2018c), we constructed a 3D complete translation invariance to specify a 3D Cartesian space.The 3D Cartesian space coordinates could be represented in three channels, such as I ∈ R W ×H ×D ×3 .Each coordinate channel is a Rank 1 tensor, such as I Coronal ∈ R W ×H ×D ×1 , whose values are filled with 1's in the first coronal plane, 2's in its second coronal plane, 3's in the third coronal plane, and so on.The values from 1 to the number of coronal planes are normalized between -1 and 1.Similarly, for the I Sagittal and I Axial coordinate channels, the values are filled through the sagittal and axial planes, respectively.Therefore, these three coordinate channels are concatenated and result in a tensor I ∈ R W ×H ×D ×3 , which completely specifies the position of 3D Cartesian space.
Based on the position information in the space, we obtain the center position information for the patches used in the patch-level prediction branch.The center position information is represented as a position indicator I .The extraction of the position indicator is described in Fig. 2b.First, when using data augmentation such as image translation and cropping, the representation of 3D Cartesian space I should be transformed in the same way as the input transformation, which results in the same spatial dimension as the input MRI scan.
Then, based on the encoder network, the center positions of the receptive field are hierarchically extracted.
The extraction was performed by employing depth-wise convolution with nonparametric kernel weights, and all weights in the kernel were set to 0, but the center-position weight is 1.
In the end, the extracted position indicator I ∈ R w×h×d×3 is taken as input for the position-based gating branch.The branch generates translation-dependent outcomes, which is not possible in the patch-level prediction branch.As described in Fig. 1, parametric functions in position embedding P π and gate network G ρ consist of convolutional layers, and all convolutional layers in both networks are point-wise convolutions parameterized by π and ρ.In the position embedding network, the semantic feature representation Î were extracted to detect the task-oriented discriminative region by increasing the number of feature maps.
Furthermore, the number of output feature maps was decreased in the gate network to encode the semantic feature representation.Finally, the remaining feature maps were averaged and activated by the sigmoid activation function to generate discriminative probability map G = (g 1,1,1 , • • • , g i,j,k , • • • , g w,h,d ).The discriminative probability map consists of g i,j,k ∈ [0, 1], representing the position-based response located in (i, j, k).By constructing position indicator I based on the representation of coordinates in the 3D Cartesian space, absolute positioning can be performed and shared over the MRI scans for each patch.Therefore, the trained position embedding and gate network represent the high response in the region where the AD-related morphological changes are consistently captured.

Gate-based Pooling for Image-level Prediction
By considering a 3D whole brain to be a bag and considering the local features extracted from 3D patches distributed in the whole brain to be instances, the proposed framework can be considered an MIL framework.
In conventional MIL-based classification problems, nonparametric operators (e.g., max and mean) have been widely used to aggregate instance-level representation into bag-level representation.Just as (Brendel and Bethge, 2019) introduced GAP for patch-level responses in aggregation, the mean operator has also been used as a representative aggregation function, especially when more than one instance is needed to identify a bag.The mean operation can directly calculate the image-level response, z, as follows: xi,j,k . (1) A parametric operator constructed by a neural network was proposed in (Ilse et al., 2018) to detect key instances and aggregate the responses based on them.Inspired by the aggregation method, we defined the image-level response by aggregating patch-level responses through position-based outcomes of gate network.
The element-wise multiplication between patch-level responses X and discriminative probability map G results in patch-level class evidence E ∈ R w×h×d×1 .The total amount of the discriminative brain region is unknown; thus, the normalization is performed based on the sum of the discriminative probability map so that the amount of the gated regions is independent of the diagnostic results.The aggregation of patch-level class evidence infers image-level abnormality and is defined as follows: where e i,j,k = g i,j,k xi,j,k .The image-level response z is directly activated by the posterior probability ŷ = p(y|X) using the sigmoid activation function.The patch-level class evidence E directly reveals which patches made a significant contribution in the final decision, making the model transparent and interpretable.

Joint Learning of Pathological Brain-region Localization and Disease Identification
The overall parameters (i.e., φ, ψ, π, and ρ) are trained based on the image-level classification objective.
To better train from the generalization perspective, the proposed models were trained using two additional techniques: label smoothing and balanced cross-entropy, referring to prior studies (Müller et al., 2019;Lin et al., 2017;He et al., 2019).The classification loss function is described as follows: where y LS ∈ {0.1, 0.9} is the modified target and β is a hyperparameter addressing the imbalanced classification problem.In addition, β was set to the inverse class frequency.Precisely, the function was calculated using the number of samples with negative annotation (y = 0) divided by the total number of samples.The gradient generated by classification loss updates the parameters, φ, ψ, π, and ρ.
Moreover, the element-wise multiplication operation between G and X allows both forward and backward propagation to be highly dependent on each other.However, in the early stages of training, randomly initialized parameters yielded both X and G.To impose the framework to explore more discriminative brain region localization, we employ an entropy loss for maximization of entropy G, as follows: The gradient generated by the entropy loss is affected on parameters π and ρ.The final total loss function is defined using hyperparameter λ to weigh the classification loss and entropy loss.Our proposed network is trained in an end-to-end manner with the following loss function: 4. Experimental Settings, Results, and Analysis

Comparative Deep Learning Methods
Three sMRI-based deep learning approaches were adopted to compare the proposed method to state-ofthe-art methods.These approaches were sequentially called the 3D CNN, attention-based 3D ResNet, and HybNet and were reimplemented in PyTorch and trained on identical data using the proposed method.The compared methods were summarized as follows: • 3D CNN (Liu et al., 2020) : The CNN-based classifier was trained end-to-end to classify disease states without anatomical prior knowledge and a localization method.We adopted a proposed model architecture for a fair comparison without clinical information, such as patients' age.Given a randomly cropped 3D image as input, sequential convolutional blocks extracted local features, and the output feature maps were flattened.The flattening vector was employed as input to the classifier.The size of cropped images for random cropping was set identically to the proposed method.Based on this setting, we compared our method to the model trained with widening factor (WF) of 1 and 2.
• Attention-based 3D ResNet (Jin et al., 2020) : This method had a similar goal: jointly learn AD-related brain-region detection and disease identification in an end-to-end manner.However, two underlying differences exist.One is that the AD-related brain regions were detected based on local features.The other difference is the nonlinear interactions between weighted local features for image-level decisionmaking.Specifically, the proposed model took the 3D whole brain as input.The 3D residual blocks were stacked for local feature extraction.The local features were pooled by using GAP, and the fully connected layer was followed for classification.An additional attention module was attached in the middle of the feature extraction to detect the AD-related brain region.This module generated spatial attention weights based on local features extracted in the middle of the network, and the attention was applied to local features.
• Hybrid network (HybNet) (Lian et al., 2020) : The HybNet consists of two branches constructed to capture 1) global structural information and 2) local structural information.First, the FCN backbone was trained to generate the DAM and mean DAM to train a GB and LB, respectively.The DAM was directly used as the attention in training the GB, whereas the mean DAM was used to determine the brain regions to extract patches.The LB was trained based on the patches extracted from predetermined brain regions.The following pruning and fine-tuning steps were performed as described in the literature.Finally, two discriminative feature vectors obtained using the GB and LB were concatenated and used as input for the training fusion branch.The fusion branch comprised two subsequent fully connected layers followed by ReLU activation.and our method.BrainBagNet-s processes a single patch size s, and the patch-level responses are aggregated using the GAP operation.The FG-BrainBagNet indicates a BrainBagNet with a feature-based gate inspired by an attention-based MIL framework (Ilse et al., 2018).Instead of using position information, local features generated by the encoder network were used as input to the gate network.Lastly, the PG-BrainBagNet is the proposed method.
In the AD diagnosis task, the best classification performance in the experimental setting appeared in PG-BrainBagNet-41.Compared to state-of-the-art methods, outperforming classification accuracy was observed.
The highest and lowest margins for the mean accuracy were 8.33% (vs.3D CNN (WF:2)) and 3.65% (vs. HybNet (GB)), respectively.In addition, the position-based gate method yields an increase in classification performance compared to BrainBagNets.In particular, the position-based gate method increased the classification performance of BrainBagNet with a small patch size by a large margin, whereas feature-based gates did not.Specifically, for PG-BrainBagNet-9, the accuracy increased by 13.13% (vs.BrainBagNet-9).Although classification performance of BrainBagNets increased as the patch size increased, PG-BrainBagNets did not exhibit significant differences according to patch size.Therefore, the reason that BrainBagNets performed classification poorly when using a small patch size might be that the whole-brain image contains many patches unrelated to the brain disease.In addition, the improvement in classification performance indicated that the AD-related regions were effectively gated by the proposed method.In the MCI conversion prediction task, the best prediction result appeared in PG-BrainBagNet-9.In the comparison of the results using the state-of-the-art methods, the maximum and minimum differences in the mean accuracy were 3.99% (vs.3D CNN (WF:2)) and 1.33% (vs.attention-based 3D ResNet and HybNet (fusion)), respectively.In terms of patch sizes, the classification performance of BrainBagNets increased as the patch size increased.Additionally, feature-based gate methods could not improve the classification results.However, the position-based gate method yielded improvements when a small patch size was used, especially when the patch size was 9, 17, or 25.Furthermore, the classification results were consistently increased as the patch size was reduced.As the receptive field size was limited, local feature representations were forced to extract the local brain changes rather than global structural changes.The results implied the brain-region localization method based on the position provided highly informative results.In addition, we observed the importance of capturing subtle changes for the early detection of AD.The MCI conversion prediction performance obtained from the proposed models trained from scratch and those trained through transfer learning is compared in the supplementary A.

Result of Discriminative Brain-region Localization
We analyzed discriminative probability maps G produced by the proposed position-based gating branch.
For better visualization, linear interpolation was performed and overlaid with the MNI template.The changes in the discriminative probability map by the learning epoch are described in Fig. 3, and the results for the

Identified Patch-level Class Evidence
The proposed method performed image-level decision-making by transparently aggregating patch-level class evidence.Thus, higher patch-level class evidence results in a greater contribution to prediction.As patch-level class evidence is created from individual 3D MRI scans, it provides an individualized analysis for AD progression.Positive values were extracted, and linear interpolation was applied to analyze the local evidence for positive class, such as AD and pMCI.The obtained 3D image was overlaid with the input image.The results for one sample per disease state are described in Fig. 5. First, we observed much more positive class evidence by the disease progression.We also observed that the overall patch-level class evidence was consistently captured in the hippocampal region, temporal lobe, and parietal lobe.However, even with the same 3D image as input, apparent differences depend on the patch size used in model training.
A trained model could detect local class evidence in the subtle brain atrophy around the hippocampal and parietal lobe area when the patch size was small.The model trained using large patches seemed to detect cues from relatively coarse structural changes compared with the model trained using small patches.Results for a larger number of coronal, sagittal, and axial planes are presented in Supplementary C. In addition, the visualization of patch-level class evidence for the more varied patch sizes and additional samples is depicted in supplementary D.  The region where the output of the proposed gate network is higher than a given threshold was masked.

Quantitative Analysis of Discriminative Regions by Patch Size
The existing AD analysis using patch-level feature representation inevitably extracted discriminative patches before performing feature extraction for diagnostic model training.Moreover, patch extraction requires various hyperparameter adjustments, such as the patch size, number of patches to extract, and so on.While the overall hyperparameters were highly correlated, this has been ignored in the patch extraction process.For instance, as the patch size increases, the number of patches containing discriminative features increases.Unlike previously proposed patch extraction (Tong et al., 2014;Suk et al., 2014;Lian et al., 2020), the proposed method learned and probabilistically indicated the importance of overlapped and widely distributed patches in the whole brain.We can perform a quantitative analysis of masked brain regions through a threshold between 0 and 1.The region where values were higher than a given threshold was masked, and the proportion of masked regions was calculated.As described in Fig. 6, by increasing the threshold from 0 to 1, the proportion of masked regions was sharply decreased for both prediction tasks of AD vs. NC and pMCI vs. sMCI.When the thresholds were either small or large, the number of remaining patches varied according to patch size.For example, the masked region difference between the AD diagnosis model with a patch size of 9 and the model with a patch size of 57 was about five times the 0.5 threshold.This results revealed the relative sparsity of the discriminative brain regions in dealing with small patches in both AD diagnosis and MCI conversion prediction.Furthermore, this sparsity explained the reason for extracting small discriminative patches, which was challenging and vital.

Stochastic Representation of Discriminative Regions
To analyze the stochastic representation of discriminative brain regions, we evaluated the classification performance of the trained PG-BrainBagNet by employing the Monte Carlo dropout (Gal and Ghahramani, 2016).By randomly dropping g i,j,k with a probability of p i,j,k ∈ {0.0, 0.5, g i,j,k , (1−g i,j,k )}, we generated four types of randomly dropped G ∈ R w×h×d×1 .Based on G , we inferred ŷ for image-level prediction.
The final decision was made through soft voting results obtained through 100 trials.The classification accuracy observed in the five-fold cross-validation is listed in Table 4.
First, when the drop probability is 0, it is identical to the results without Monte Carlo dropout, which are presented in Tables 2 and 3.If the drop probability increased to 0.5, the prediction was made using the randomly dropped patch-level class evidence.The best classification results are exhibited by patch sizes 41 and 9 in the AD diagnosis and MCI conversion prediction task, respectively.Although the decision was made using randomly dropped local class evidence, the soft voting results show equivalent the classification accuracy with 0 drop probability.When the drop probability was set as g i,j,k , the model loses local class evidence appearing in the discriminative brain regions with high probability.Therefore, regardless of the patch size, the classification result was similar to the chance ratio of a binary classification.Lastly, when the drop probability was 1 − g i,j,k , the model lost local class evidence appearing in the discriminative brain regions with low probability.However, the evidence from regions with relatively low g i,j,k , such as the parietal and frontal lobes, might be dropped with high probability.In the AD diagnosis task, we observed that the information generated in the regions decreased the classification accuracy when small patches were used.
We speculated that the subtle changes captured in the parietal and frontal lobes could provide informative cues for AD diagnosis.We observed increased classification accuracy in the MCI conversion prediction task when patch sizes of 9, 25, and 41 were used.This result implies that the discriminative region representation G might contain uncertainty across subjects and that estimating the posterior probability of y using Bayesian approaches could lead to better performance in predicting MCI conversion.

Ablation Study for Localization Methods
We performed an ablation study of localization methods using our proposed framework to evaluate the effectiveness of joint learning of discriminative brain-region localization and disease identification.We compared four localization methods: "w/o", "mean DAM", "mean DAM (0.3)", and "end-to-end".The "w/o" denotes BrainBagNets, and the "end-to-end" denotes the proposed models, which are PG-BrainBagNets.
Both "mean DAM" and "mean DAM (0.3)" were models trained with the predetermined discriminative brain region inspired by the mean DAM introduced in (Lian et al., 2020).For "mean DAM", the proposed framework has been trained using predetermined G by considering the mean DAM to be G instead of training position embedding and gate network.The resulting model was denoted as "mean DAM".In addition, we generated a binary mask because the representation was not used as a probabilistic value but instead was used for extracting patches in (Lian et al., 2020).In patch extraction, the threshold of 0.3 was used in the literature to represent potential patch locations.We obtained a binary mask based on this threshold, and model training was performed in the same way as for "mean DAM".The predetermined G for the "mean DAM" and "mean DAM (0.3)" are illustrated in Fig. 7a.
The average accuracy and AUROC in five-fold cross-validation are described in Figs.7b and 7c.First, when the model was trained without brain-region localization, classification performance decreased as the patch size reduced.The model trained using the smallest patches exhibited the lowest classification performance for both tasks in accuracy and AUROC.By adding the predetermined localization method, the classification performance improved compared with that without the localization method.However, localization was performed regardless of the diagnosis model, resulting in a worse classification than the proposed method.In the MCI conversion prediction task, only the proposed method demonstrated increased classification perfor- mance by limiting the increase in patch size.This result implies that the regularization of the patch size allows extracting AD-related local and subtle changes but requires suitable brain-region localization dependent on the diagnosis model.

Regions of Disease Progression
The proposed method was constructed by transparently aggregating patch-level responses for image-level predictions.Thus, individualized AD progression regions were detected in a weakly supervised manner.We confirmed patch-level responses over the testing dataset to analyze the detected regions where AD progression occurs.For correctly predicted AD samples, patch-level class evidence was obtained, and positive values were left to obtain class evidence for AD class.The results were aggregated over the testing dataset in five-fold cross-validation.We calculated the point-wise mean and standard deviation, and the result were upsampled as the size of the MNI template, visualized in Fig. 8.
We observed that the high mean values were located in the hippocampal regions regardless of the patch size in the model.Additionally, the area where high standard deviation values were distributed was similar to that where the high mean values were distributed.The high responses described in the standard deviation map revealed the high region variance where high patch-level class evidence appeared even across AD subjects.

Effectiveness of Subtle Changes Captured using Small Patches
We demonstrated that the model extracting the local class evidence captured using a small receptive field size better predicts MCI conversion.We analyzed the two pMCI samples that yielded false negatives from the model trained using a large patch and made predictions correctly by limiting the patch size.The local class evidence according to patch size is described in Fig. 9.The first row depicted the original sMRI scans with image ID 26442 and 31799.The following rows demonstrated the patch-level class evidence according to the patch size.For a better comparison, the areas where a high amount of class evidence was contained were marked with dashed rectangles and denoted as R1 to R4.The blue and red colors indicate high class evidence for sMCI and pMCI classes in that region, respectively.The model trained using small patches displayed fine class evidence detected in local regions, whereas the class evidence found in models using large patches exhibited coarse patch-level class evidence results.
First, in the bottom of the R1 region, we observed that Sample #26442 contains brain atrophy in the temporal lobe rather than the hippocampus, compared to Sample #31799.In contrast, Sample #31799 depicts brain atrophies located in the hippocampal area.These brain atrophies were correctly captured by the model trained using small patches.However, the estimation produced by the model trained using larger patches demonstrated the difficulties of capturing these subtle changes.These patterns can be observed in the R1, R2, and R3 regions.The positive class evidence for the pMCI class located in the parietal lobe area was only captured by the model trained using small patches, which can be observed in comparing the R4 region.
Finally, the model trained using small patches could determine sufficient local evidence to correctly predict the MCI conversion, whereas models trained using large patches could not capture sufficient cues for a correct decision.In this analysis, we observed that regularizing the increasing patch size increased the prediction performance for MCI conversion by extracting subtle and local structural feature representation.

Conclusion
The sMRI-based deep learning approaches have been widely used and continue evolving.Many studies have focused on subtle brain atrophy to better understand AD biomarkers in sMRI.As not all subtle structural changes are associated with brain disease, discriminative brain-region localization has attracted attention.
To alleviate the problem of brain-region localization for patch-level feature representation, we proposed a framework for jointly learning discriminative brain-region localization and disease identification in an endto-end manner.In the experiment, we evaluated the proposed method on the ADNI dataset.The proposed method displayed the best classification performance in both the AD diagnosis and MCI conversion prediction task compared with existing methods.Primarily, the proposed method effectively increased the classification performance, when localization of the subtle changes was required.We also demonstrated the interpretability of the proposed method by tracing the rationale for the model predictions down to the small patch level.

Figure 1 :
Figure 1: Illustration of the proposed PG-BrainBagNet, including the encoder, classifier, position embedding, and gate network.Structural and position information is processed in two separate branches, given a magnetic resonance imaging scan and position indicator.The outcomes from the two branches are combined to represent patch-level class evidence and image-level disease probability.
augmentation, such as image translation and random cropping, can be applied in model training; thus, the spatial sizes of the 4D tensor and the input MRI scans may not be identical.The representation of the 3D space coordinates indicates the brain-region position and allows extracting position information from which patch-level responses are obtained.The extracted patch position information is denoted as a position indicator I .Then, given the position indicator, the position-based gating branch consisting of the position embedding and gate network produces position-based responses.The high responses represent the AD-related brain(a) Three-dimensional (3D) patch x s i,j,k comprising a 3D structural magnetic resonance imaging (sMRI) scan X on the Cartesian coordinate system.(b) Representation of coordinates in three-dimensional Cartesian space I and the extraction of position indicator I .

Figure 2 :
Figure 2: Illustration of (a) three-dimensional (3D) patch extracted in a specific position and size and (b) the position indicator extracted from the coordinates in 3D Cartesian space.

Figure 3 :
Figure 3: Changes in the output of the proposed position-based gating branch according to the number of training epochs and patch size used in model training.

Figure 4 :
Figure 4: Difference in the output of the trained position-based gating branch depending on the task and patch size used in model training.

Figure 5 :
Figure 5: Difference in the positive patch-level class evidence produced by the proposed method according to disease state and patch size used in model training.Each row indicates one sample per disease state, where the number next to the # denotes the corresponding image ID of an input MRI scan.

Figure 6 :
Figure 6: Difference in the proportion of masked brain regions according to the task, threshold, and patch size used in model training.

Figure 7 :
Figure 7: Illustration of classification performance in terms of accuracy and area under the receiver operating characteristic (AUROC) by the patch size used in model training, localization method, and task.

Figure 8 :
Figure 8: Statistics of the positive patch-level class evidence drawn from true-positive examples.

Figure 9 :
Figure 9: Examples of false negatives from the model trained with large patches but correctly predicted by the model trained using small patches.Each column indicates one sample labeled as progressive mild cognitive impairment (pMCI), where the number next to the # denotes the corresponding image ID of an input MRI scan.In addition, the blue and red colors indicate high class evidence for the stable mild cognitive impairment and pMCI class in that region, respectively.

Table 1 :
Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort and its corresponding demographic information.In addition, images converted into the AD class within 36 months were labeled as the pMCI class.In this process, MCI samples with reversion from the AD class to other classes were excluded from the dataset.Overall, in ADNI-1, there were 231, 223, 168, and 200 sMRI scans for NC, sMCI, pMCI, and AD, respectively.For ADNI-2, there were 204, 274, 83, and 159 scans in the same order.The demographics and clinical information are presented in Table

Table 3 :
Performance comparison on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset in mild cognitive impairment (MCI) conversion prediction (progressive MCI vs. stable MCI).

Table 4 :
Comparison of the classification accuracy through Monte Carlo dropout using four types of drop probability.